Import libraries

Dataset import

Dataset Information

Unique Values

Pie Plot

Scatter Plots.

Is the price of health insurance for smokers higher?

The BMI regarding the price of insurance. Maintain a linear trend for smokers. Otherwise, it remains constant. And possibly contains outliers.

The age variable is related to users who do not smoke. It maintains a linear trend.

Using the histogram and the box plot. We confirm the presence of outliers. So we have to give it special processing.

Technically we can give outliers the same treatment as missing values.

With interactive graphics. We witness values ​​that are out of the ordinary. In turn values with a low level of BMI. It is also appreciated that men have a higher BMI since they have greater physical complexion.

Feature engineering

Upper limit

Replaces values greater than the upper range

Lower limit

Smoker no split

Upper limit

We transform the outliers to null

We calculate the percentage of null values

We still see outliers. Therefore, I will decide to divide the dataset based on the age of the user. For better cleaning.

We divide according to age.

Now the data set is devoid of outliers. But to avoid data loss, we are going to create a model to be able to substitute null values. To have a better closeness than to replace them with a statistical measure.

Replace null values.

I will obtain by creating a linear regression model. To replace those values. Since the variables of age and insurance charge. It has a linear trend for non-smokers.

Split data in train and test

Linear Model

Cross validation

The generalization value was quite high. By which the linear model can describe 97% of the observations. We can use it to substitute missing values to avoid excessive data loss.

Substitution of null values

Outliers for smokers. Convert them to missing values ​​so you can replace them in a similar way as we did before.

Substitution of null values

Substitution of null values

We create a dataset with the clean data

We save the dataset with the clean data